What is Statistical Learning?


  • Depicted above are Sales vs TV, Radio, and Newspaper advertising budgets, with a blue linear-regression line fit separately to each

  • Do you think that we could predict Sales using these three advertising budgets?

  • Perhaps we could utilize a model to help us

    • \(Sales \approx f(TV, Radio, Newspaper)\)

    • Where Sales is a response (dependent) variable that we want to predict

      • We generally refer to this as Y


    • TV, Radio, and Newspaper are features (independent variables) that we name \(X_1, X_2, X_3\), respectively

      • We can refer to the input vector collectively as: \(X = \begin{pmatrix} X_1\\ X_2\\ X_3 \end{pmatrix}\)

      • This allows us to write the model as: \(Y = f(X) + \epsilon\)

        • Where \(\epsilon\) captures measurement errors and other discrepancies not directly modeled

What is \(f(X)\) Good For?

\(Y = f(X) + \epsilon\)

  • Here \(f\) is some fixed but unknown function of \(X_1,...,X_p\) and \(\epsilon\) is an error term which is independent of \(X\) and has mean zero

  • In this formulation, \(f\) represents the systematic information that \(X\) provides about \(Y\)

  • The two main reasons to estimate \(f\) are prediction and inference

  • Prediction

    • With a good \(f\) we can make predictions of \(Y\) at new points of \(X = x\)


  • Inference

    • Additionally, we can understand which components of \(X = (X_1, X_2, ..., X_p)\) are important in explaining \(Y\), and which are irrelevant (i.e., Seniority and years of education have a big impact on Income, but Marital Status typically does not)

Prediction

  • In many situations, a set of inputs \(X\) are readily available, but the output \(Y\) cannot be easily obtained

  • We can make predictions with the following: \(\hat{Y} = \hat{f}(X)\)

    • Where \(\hat{f}\) represents our estimate of \(f\) and \(\hat{Y}\) represents the resulting prediction for \(Y\)

      • When trying to create the most accurate prediction of \(Y\), we may not be as concerned with the exact form of \(\hat{f}\)


  • Example:

    • Suppose that \(X_1, ...., X_P\) are characteristics of a patient’s blood sample that can be easily measured in a lab, and \(Y\) is a variable encoding the patient’s risk for a severe adverse reaction to a particular drug. It is natural to seek to predict \(Y\) using \(X\), since we can then avoid giving the drug to patients who are at higher risk of an adverse reaction


  • The accuracy of \(\hat{Y}\) as a prediction for \(Y\) depends on the following quantities:

    • Reducible Error

      • Usually, \(\hat{f}\) will not be a perfect estimate for \(f\), and this inaccuracy will introduce some error, which is called reducible error. It is reducible because we can potentially improve the accuracy of \(\hat{f}\) by using the most appropriate statistical learning technique to estimate \(f\)


    • Irreducible Error

      • Even if we had a perfect estimate for \(f\), so that our estimated response took the form \(\hat{Y} = f(X)\), our prediction would still have some error in it because \(Y\) is also a function \(\epsilon\), which by definition cannot be predicted using \(X\). This variability associated with \(\epsilon\) also affects the accuracy of our predictions, and is irreducible because no matter how well we estimate \(f\), we cannot reduce the error introduced by \(\epsilon\)


  • Our goal will be on techniques of estimating \(f\) that will minimize the reducible error


Inference

  • There are times when we are interested in understanding the association (relationship) between \(Y\) and \(X_1,...,X_p\)

  • We will estimate \(f\), but our goal is not necessarily to make prediction for \(Y\), therefore, we need to have a better understanding of the exact form of \(\hat{f}\)

  • We may be interested in answering the following questions:

    • Which predictors are associated with the response?

    • What is the relationship between the response and each predictor?

    • Can the relationship between \(Y\) and each predictor be adequately summarized using a linear regression or is the relationship more complicated?

  • Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating \(f\) may be appropriate

    • Linear models provide relatively simple and interpretable inference, but may not yield as accurate predictions as other approaches

    • Alternatively, other non-linear approaches can provide highly accurate predictions of \(Y\), but less interpretable models for inference


How Do We Estimate \(f\)

  • Parametric Methods

    • We make an assumption about the functional form, or shape, of \(f\)

      • The linear model is a parametric model, where we assume the functional form is linear

        • \(f(X) = B_0 + B_1X_1 + B_2X_2 + ... B_pX_p\)


  • After a model has been selected, we will use training data to fit or train the model

    • A common approach for fitting a linear model is referred to as ordinary least squares


  • Parametric models reduce the problem of estimating \(f\) down to estimating a set of parameters

  • A potential negative of parametric models is that the model we choose will usually not match the true unknown from of \(f\)

    • We can try to correct for this by utilizing more flexible models that can fit may different possible functional forms of \(f\)

      • A down side to more flexible models is the increased number of parameters that need to be estimated and can lead to overfitting the data, which means our model is following we are modeling our errors (random noise) and not our data


  • Non-Parametric Methods

    • Non-parametric methods do not make explicit assumptions about the functional form of \(f\)

    • Instead they seek an estimate of \(f\) that gets as close to the data points as possible without being too rough or wiggly

    • Advantages

      • By avoiding assumptions pertaining to the functional form of \(f\), they can potentially more accurately fit a wider range of possible shapes for \(f\)


    • Disadvantages

      • Since they do not reduce the problem of estimating \(f\) to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for \(f\)

Prediction Accuracy vs Model Interpreability

  • Linear models are easy to interpret, while thin-plate splines are not

  • Good fit vs. over-fit or under-fit

    • How do we know when the fit is just right?


  • Parsimony vs. black-box

    • We often prefer a simpler model involving fewer variables over a black-box predictor involving them all

Assessing Model Accuracy

  • In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data

  • In the regression setting, the most commonly-used measure is the mean squared error (MSE)

    • \(MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2\)

      • Where \(\hat{f}(x_i)\) is the prediction that \(\hat{f}\) gives for the ith observation

      • The MSE will be small if the predicted responses are very close to the true responses and large if the predicted and true responses differ substantially


  • Suppose we fit a model \(\hat{f}(x)\) to some training data, \(Tr = \{x_i,y_i\}^n_1\) and wish to see how well it performs

    • We could compute the average squared prediction error over \(Tr\)

    • \(MSE_{Tr} = Ave_{i\epsilon Tr}[y_i-\hat{f}(x_i)]^2\)


  • This may be biased toward more overfit models

  • We are not overly concerned with how well our method works on the training data. Instead, we are interested in the accuracy of the predictions that we obtain when we apply our method t the previously unseen test data

  • Therefore, we can compute the MSE on the test data \(Te = \{x_i,y_i\}^m_1\)

    • \(MSE_{Te} = Ave_{i\epsilon Te}[y_i-\hat{f}(x_i)]^2\)

Bias-Variance Trade-off

  • Variance refers to the amount by which \(f\) would change if we estimated it using a different training data set

    • Since the training data are used to fit the statistical learning method, different training data sets will result in a different \(f\). But ideally the estimate for \(f\) should not vary too much between training sets


  • Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model

    • Think of the linear model, which assumes a linear relationship, which is unlikely that any real-life problems are truly simple linear relationships


  • As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases